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ABSTRACT 

This  thesis  asserts  that  Cluster  Analysis,  or  Numerical 
Taxonomy,  has  many  potential  applications  in  the  field  of 
international  relations.   It  demonstrates  two  representative 
applications.   Both  examples  treat  the  nations  of  the  world 
as  objects  having  measurable  attributes,  and  both  examples 
use  selected  attributes  to  produce  a  dendrogram  (or 
hierarchical  classification)  of  the  nations  of  the  world. 
In  one  example  this  dendrogram  is  used  to  objectively  group 
the  nations  into  blocs  based  on  external  economic  ties.   In 
the  other  example  the  dendrogram  is  used  to  highlight  inter- 
actions among  five  attributes,  ignoring  the  identity  of 
individual  nations,  the  same  way  a  scatter  plot  highlights 
interactions  between  two  variables. 
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I.   BACKGROUND 

A.   CLUSTER  ANALYSIS  DEFINED 

The  subject  of  this  thesis  is  a  group  of  mathematical 
techniques  known  collectively  as  either  Cluster  Analysis  or 
Numerical  Taxonomy.   The  terms  are  equivalent.   Their  formal 
definition,  paraphrased  from  Ref.  6,  is  actually  a  sequence 
of  definitions,  as  follows 

classification  system  -  a  set  of  subsets  of  a  set  of 

objects  which  conveys  some 
information  about  the  objects 

taxonomy  -  the  science  of  constructing  classificatory 
systems 

Cluster  Analysis  or  Numerical  Taxonomy  -  the  science 

of  constructing  mathematical  classificatory 
systems 

In  less  formal  terms,  Cluster  Analysis  includes  all  mathe- 
matical methods  of  classifying  objects  into  sets  so  as  to 
represent  complex  data  in  a  simpler  way  which  will  serve  as 
a  fruitful  source  of  hypothesis. 

Cluster  Analysis  is  a  two  stage  process.   The  first  stage 
is  to  choose  quantifiable  attributes  that  describe  the  objects, 
and  then  use  these  attributes  to  measure  the  pair-wise  dissim- 
ilarity among  the  objects.   The  second  stage  is  to  represent 
these  dissimilarities  by  an  appropriate  classificatory  system 
or  display. 
/      The  input  to  Cluster  Analysis  is  normally  an  n  x  m  matrix 
of  data,  measurements  of  m  attributes  for  each  of  n   objects. 


/ 


The  output  from  Cluster  Analysis  is  normally  one  of  three 

displays : 

A  hierarchical  classification,  commonly  called  a  tree 
diagram  or  dendrogram; 

A  partition  of  the  objects  into  mutually  exclusive  sets, 
each  set  described  by  a  "profile"  or  vector  of  m  average 
attribute  values; 

A  "clumping"  of  the  objects  into  sets  that  may  overlap,  each 
set  again  described  by  a  profile. 

The  value  of  these  outputs  is  that  they  summarize  the  original 

data  objectively  and  they  tend  to  highlight  subtle  interactions 

in  the  original  data,  enabling  a  user  to  formulate  reasonable 

hypotheses  about  these  interactions. 

B.   PREVIOUS  APPLICATIONS  OP  CLUSTER  ANALYSIS 

Cluster  Analysis  was  developed  in  the  eighteenth  century 
by  botanists  and  biologists  attempting  to  inject  more  objec- 
tivity into  their  classifications  of  plant  and  animal  specimens 
(the  familiar  phylum-genus-species  scheme).   Subsequently  the 
same  technique  was  used  by  geologists.   Most  recently,  Cluster 
Analysis  has  found  numerous  applications  in  the  social  sciences, 
particularly  in  psychology.   Reference  1  describes  an  applica- 
tion that  is  representative. 


II.   PROPOSED  APPLICATIONS  OF  CLUSTER  ANALYSIS 
IN  THE  STATE  DEPARTMENT 

The  United  States  State  Department  is  currently  trying 
to  revitalize  its  policy-making  and  resource-allocating 
functions,  a  la  the  Defense  Department  metamorphosis  under 
Robert  McNamara.   This  revitalization  effort  has  been  under- 
way for  eight  years  now.   In  that  time  there  has  been  published 
a  plethora  (References  2  and  3  are  representative)  of  "master 
plans"  for  the  incorporation  of  Systems  Analysis  in  the  State 
Department . 

This  paper  does  not  propose  another  master  plan,  but 
merely  suggests  that  a  single  existing  statistical  analysis 
technique  has  useful  applications  within  the  State  Department. 
The  existing  technique  is  Cluster  Analysis,  and  the  potential 
applications  within  the  State  Department  are  described  and 
demonstrated  in  the  pages  that  follow. 

A.   FIRST  APPLICATION:   TO  HIGHLIGHT  INTERACTIONS  OF  VARIABLES 
1,   General  Description 

In  this  application,  Cluster  Analysis  highlights  the 
interactions  of  several  variables  the  same  way  a  scatter  plot 
would  for  two  variables.   It  inputs  an  n  x  m  matrix  of  data 
(n   countries,  each  described  by  m  variables)  and  outputs  a 
dendrogram.   The  dendrogram  itself  says  nothing  about  inter- 
actions  among  the  variables.   But  it  is  a  simple  matter  to 
select  a  clustering  level  (where  k  -   the  number  of  clusters) 


and  plot  the  distribution  of  the  m  variables  within  each 
cluster.   Comparisons  among  these  plots  should  bring  out  all 
significant  interactions  among  the  variables.   In  particular, 
it  should  highlight  mutual  interaction  among  three  variables 
or  even  among  four  variables  just  as  easily  as  it  highlights 
a  two-way  interaction.   This  is  a  potential  not  shared  by 
factor  analysis  and  regression  techniques. 
2.   Scenario  for  Demonstration 

The  United  States  Constitution  lists  Freedom  of  the 
Press,  Freedom  of  Speech,  and  Freedom  of  Assembly  as  inalien- 
able human  rights.   Although  one  might  argue  that  these 
precise  terms  have  been  eclipsed  by  communications  technology, 
most  of  the  Western  world  would  agree  that  "free  and  facile 
communication  among  the  people"  is  an  essential  quality  in  a 
free  and  productive  society.   Having  tentatively  accepted 
this  thesis,  a  sociologist  or  political  scientist  might  well 
wish  to  dissect  the  concept  of  "free  and  facile  communication 
among  the  people,"  to  define  it  in  quantifiable  terms.   More- 
over, a  policy  planner  in  the  State  Department  might  well  wish 
to  go  one  step  further:   to  use  this  quantitative  definition 
in  a  comparative  study  of  the  countries  of  the  world.   Such 
comparisons  are  made  every  day  with  respect  to  Gross  National 
Product,  Life  Expectancy,  etc.   Why  not  also  tabulate  a  FFCAP 
(free  and  facile  communication  among  the  people)  Index? 

Assuming  that  the  State  Department  considered  it 
worthwhile  to  develop  such  an  index,  they  would  probably  task 
a  team  of  their  sociologists  to  propose  a  list  of  measurable 
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factors  that  either  contribute  to  or  detract  from  "free  and 
facile  communication  among  the  people."   This  team  would 
certainly  appreciate  the  practical  advantages  of  building 
this  list  around  statistics  that  had  already  been  measured, 
and  using  the  existing  data  to  continually  validate  their 
theories  against  the  real  world. 

Thus  they  would  probably  be  faced,  early  on  in  their 
proceedings,  with  a  large  volume  of  existing  data  to  be 
perused,  or  analyzed  in  a  very  general  sense.   At  this  point 
they  could  profit  greatly  from  applying  Cluster  Analysis  to 
highlight  the  interaction  of  variables. 
3«   Choice  of  Data 

To  demonstrate  this  application,  the  author  has  usurped 
the  role  of  State  Department  sociologist  and  selected  the 
following  statistics  as  "measurable  factors  that  either  contri- 
bute to  or  detract  from  free  and  facile  communication  among 
the  people" : 

Variable  1.   Concentration  of  Population  in  Cities,  1965 

Variable  2.   Radios  per  1000  Population,  1965 

Variable  3.   Students  in  Higher  Education  (Third  Level) 
per  One  Million  Population,  1965 

Variable  4.   Ethno-Linguistic  Fractionalization 

Variable  5.   Press  Freedom  Index,  1965 
See  Appendix  A  for  definitions  of  these  variables. 

It  is  readily  admitted  that  this  list  is  not  as 
complete  as  it  should  be.   In  particular,  "Literacy  Rate"  is 
conspicuous  by  its  absence,  and  some  measure  of  newspaper 
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circulation  seems  a  necessary  counterpart  to  Variable  2. 

The  reason  for  such  omissions  was  unavailability  of  data, 

an  affliction  that  is  widespread  among  independent  researchers 

but  not  shared  by  insiders  at  the  State  Department. 

The  unavailability  of  data  to  this  researcher  imposed 
another  artificiality  on  this  demonstration  besides  the  omission 
of  some  desirable  variables.   Table  I  displays  values  of  the 
aforementioned  variables  for  only  85  of  the  136  nations  in  the 
world.   It  was  necessary  to  delete  the  other  nations  because 
of  excessive  missing  data. 

^ •   Choice  of  Dissimilarity  Coefficient 

This  section  describes  the  process  of  converting  the 
data  in  Table  I  to  a  matrix  of  Dissimilarity  Coefficients. 

The  first  decision  point  was  to  specify  a  formula  for 
the  Dissimilarity  Coefficient  (DC).   The  DC  is  a  single  real 
number  specifying  the  amount  of  dissimilarity  between  Country 
A  and  Country  B,  obtained  by  somehow  combining  the  five  data 
points  describing  each  country.   There  are  many  different 
formulas  for  transforming  these  ten  data  points  to  a  single 
DC.   Cormack  presents  a  concise  but  comprehensive  summary  of 
all  the  common  formulas  in  Table  1  of  Ref.  4. 

In  the  situation  at  hand  it  was  decided  to  use  a 
Euclidean  Distance,  standardized  by  range.   That  is, 


DC 
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Euclidean  Distance  was  preferred  to  others  simply 
because  of  its  geometric,  intuitive  appeal. 

On  the  other  hand,  a  more  elaborate  rationale  went 
into  the  decision  to  standardize  by  range.   First  of  all  it 
was  decided  that  some  type  of  standardizing  (scaling)  would 
be  appropriate.   Most  of  the  literature  argues  convincingly 
that  scaling  is  inappropriate  when  the  difference  in  scale 
between  two  variables  may  be  intrinsic;  but  no  such  intrinsic 
differences  seemed  likely  in  the  five  variables  used  here. 
Moreover,  using  unstandardized  Euclidean  Distance  in  this 
situation  would  clearly  result  in  the  DC  being  driven  by 
Variables  2  and  3  while  Variables  1  and  k   would  be  virtually 
ignored,  and  there  is  no  a  priori  reason  to  intentionally 
emphasize  one  variable  over  another  in  this  application. 

Having  decided  to  use  some  type  of  scaling,  there 
were  many  types  to  choose  from,  namely  scaling  by  standard 
deviation,  scaling  by  range,  and  scaling  by  some  other 
heterogeneity  measure  (see  page  326  of  Ref.  4  for  a  comparative 
discussion).   Since  the  data  distributions  were  mixed  there 
was  no  compelling  theoretical  reason  for  choosing  one  scaling 
method  over  another.   Eventually,  scaling  by  range  was 
selected  for  its  simplicity.   It  would  be  interesting  to  see 
if  scaling  by  standard  deviation  would  significantly  change 
the  end  result  (dendrogram)  from  that  obtained  here;  but  this 
was  not  done. 

Using  Euclidean  Distance  standardized  by  range,  the 
425  data  points  in  Table  I  were  transformed  into  a  matrix  of 
3570  Dissimilarity  Coefficients. 
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5»   Choice  of  Algorithm 

There  are  several  methods  of  proceeding  from  the  matrix 
of  Dissimilarity  Coefficients  (DC)  to  a  partition  (or  a  dendro- 
gram of  partitions)  of  the  countries.   Choosing  one  was  the 
next  decision  point  in  this  demonstration.   All  methods  in 
general  use  fall  into  one  of  three  categories:   ' 

a.  Agglomerative  Algorithms  -  a  series  of  successive 
fusions  of  the  85  countries  into  groups. 

b.  Divisive  Algorithms  -  a  partitioning  of  the  complete  set 
of  countries  successively  into  finer  partitions. 

c.  Reallocative  Algorithms  -  successive  reallocation  of 
individual  countries  between  the  sets  of  some 
initial  partition. 

It  was  first  decided  that  a  reallocative  algorithm  would  be 

inappropriate  because  it  requires  an  initial  partition,  and 

there  was  no  a  priori  evidence  to  suggest  what  that  partition 

should  be.   Between  the  two  remaining  alternatives,  theoretical 

considerations  did  not  yield  a  preference:   in  nearly  every 

case,  agglomerative  and  divisive  algorithms  produce  identical 

dendrograms.   The  agglomerative  algorithm  was  selected  because 

its  details  have  been  more  thoroughly  documented  in  the 

literature . 

Within  the  family  of  agglomerative  algorithms  there 

are  at  least  eight  documented  alternative  "sorting  strategies" 

or  formulas  for  determining  the  DC  between  cluster  (k)  and 

cluster  (ij),  using  the  DC  between  cluster  (k)  and  cluster  (i) 

and  the  DC  between  cluster  (k)  and  cluster  (j).   If  the  matrix 

of  Dissimilarity  Coefficients  contains  natural  and  compelling 

clusters,  each  having  strong  internal  cohesion  and  strong 
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external  isolation,  then  the  choice  of  sorting  strategy  is 
not  a  critical  one.   But  if  natural  and  compelling  clusters 
are  not  present,  different  sorting  strategies  can  produce 
markedly  different  dendrograms.   The  eight  common  sorting 
strategies  are  explained  in  Chapter  3  of  Ref.  4.   Of  those 
eight,  the  Complete  Linkage-Furthest  Neighbor  sorting  strategy 
and  the  Single  Linkage-Nearest  Neighbor  sorting  strategy 
represent  the  extremes.   The  others  may  be  thought  of  as 
compromises  between  these  two.   The  Complete  Linkage-Furthest 
Neighbor  sorting  strategy  can  be  expressed  mathematically  as 

DC(k,ij)  =  max(  DC(k,i),  DC(k,j)  ) 

It  produces  compact  clusters  having  high  internal  cohesion; 
but  it  may  sacrifice  external  isolation  when  natural  and 
compelling  clusters  are  not  intrinsic  in  the  data.   At  the 
other  extreme,  the  Single  Linkage-Nearest  Neighbor  sorting 
strategy  can  be  expressed  mathematically  as 

DC(k,ij)  =  min(  DC(k,i),  DC(k,j)  ) 

It  tends  to  produce  chains  of  objects  in  addition  to,  or 
instead  of,  compact  clusters,  especially  when  natural  and 
compelling  clusters  are  not  intrinsic  in  the  data.   In  some 
applications  this  tendency  is  desirable. 

For  the  demonstration  at  hand  compact  clusters  were 
considered  far  more  desirable  than  chains,,  and  the  Complete 
Linkage-Furthest  Neighbor  sorting  strategy  was  selected.   It 
would  be  interesting  to  see  if  one  of  the  compromise  sorting 
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strategies,  such  as  Group  Average,  would  significantly  change 
the  end  result  (dendrogram)  from  that  obtained  here;  but  this 
was  not  done. 

The  dendrogram  in  Drawing  1  was  obtained  using  the 
Complete  Linkage-Furthest  Neighbor  sorting  strategy  in  an 
agglomerative  algorithm.   The  computer  program  is  listed  at 
the  end  of  this  thesis  for  information.   It  should  be  noted 
that  the  matrix  of  Dissimilarity  Coefficients  were  standardized 
to  the  (0.0,  100.0)  interval,  using  scaling  by  range,  before 
they  were  input  to  the  clustering  algorithm.   But  such 
standardizing  was  made  for  computational  convenience  only. 
Its  single  effect  was  a  monotonic  transformation  of  the 
numerical  scale  across  the  top  of  the  dendrogram.   The  shape 
of  the  dendrogram  was  unaffected. 

6.   From  Dendrogram  to  Cluster  Profiles 

Having  obtained  the  dendrogram  in  Drawing  1,  and 
recalling  that  the  purpose  here  was  to  highlight  the  inter- 
actions among  variables,  it  remained  only  to  select  a  level 
of  clustering,  identify  the  partition  of  countries  there, 
plot  the  distributions  of  variables  within  each  cluster,  and 
compare  these  plots.   But  several  iterations  of  this  process 
were  required  before  the  interactions  among  variables  began 
to  appear. 

The  first  attempt  was  at  level  k  =  3.   Here  the 
United  States  appeared  alone  in  one  cluster,  and  the  other 
two  clusters  contained  Ml  countries  and  ^3  countries 
respectively.   Apparently  the  United  States  was  alone  because 
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of  its  extreme  values  in  variables  2  and  3.   At  this  point 
the  United  States  was  evaluated  as  an  outlier  and  was  not 
included  in  subsequent  comparisons.   Within  each  of  the  other 
two  clusters,  the  mean  and  standard  deviation  of  each  of  the 
five  variables  were  computed.   After  a  short  perusal  of  these 
20  statistics  it  became  apparent  that  no  interactions  among 
variables  were  highlighted  at  this  level.   Within  four  of  the 
five  Variables,  the  two  means  were  displaced  from  each  other 
by  less  than  the  sum  of  their  standard  deviations. 

For  the  second  iteration,  level  k  =  7  was  chosen. 
Again  the  pair-wise  displacements  between  means  were  compared 
to  standard  deviations.   The  standard  deviations  were  definitely 
smaller  here  than  they  had  been  at  the  k  =  3  level :   within 
cluster  homogeneity  had  improved.   Between  cluster  hetero- 
geneity had  improved  to  a  lesser  extent :   many  of  the  means 
were  well  separated  but  several  others  were  not. 

For  the  third  iteration,  level  k  =  8  was  chosen.   Here 
there  were  three  clusters  containing  only  one  country  each. 
All  three  were  dismissed  as  outliers,  leaving  five  clusters 
for  further  study.   Within  each  of  these  five  clusters,  the 
mean  and  standard  deviation  of  each  of  the  five  variables 
were  computed,  using  a  straightforward  computer  program. 
These  statistics  are  listed  in  Table  II  and  displayed  graphi- 
cally in  Table  III.   After  a  relatively  brief  perusal  of 
Table  III,  several  possible  interactions  among  the  variables 
came  to  mind.   Then  a  quick  double-check  of  Table  II  confirmed 
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that  four  of  those  possible  interactions  were  probable 
interactions.   These  probable  interactions  are  listed  in 
Table  IV. 

None  of  these  "probable  interactions"  was  verified 
mathematically.   The  first  one  would  have  been  relatively 
easy  to  check  out,  by  computing  10  correlation  coefficients. 
But  the  others  would  have  required  considerably  more  ingenuity. 
Since  the  purpose  here  was  to  demonstrate  a  new  application  of 
cluster  analysis  rather  than  to  deduce  substantive  results, 
mathematical  verification  was  considered  beyond  the  scope  of 
this  thesis. 

However,  there  was  a  further  step,  within  the  scope 
of  this  thesis,  that  might  have  been  pursued  but  was  not.   It 
would  have  been  logical  to  proceed  next  to  another  cluster 
level  (perhaps  k  *  13)  and  again  look  for  interactions.   Such 
reiterations  might  well  confirm  or  refine  the  interactions 
already  deduced,  and  highlight  additional  interactions  as  well. 

B.   SECOND  APPLICATION:   TO  CLASSIFY  COUNTRIES  OBJECTIVELY 
1.   General  Description 

In  this  application,  the  user  presumes  to  understand 
the  variables  used,  and  the  interactions  among  these  variables, 
at  least  on  a  superficial  level.   The  purpose  here  is  not  to 
research  the  variables,  but  rather  to  objectively  classify  the 
countries.   The  previous  application  (to  highlight  interactions 
among  variables)  produced  a  dendrogram  only  as  an  intermediate 
step  before  producing  "cluster  profiles"  as  the  final  product. 
But  the  current  application  seeks  only  the  dendrogram  itself, 
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Table  III 
CLUSTER  PROFILES,  OUTPUT  FROM  HIGHLIGHTING  DEMONSTRATION,  GRAPHED 


Variable  1.  Concen.  of  Population  in  Cities 
min=0.0  .05  .10  .15 


0.21=max 


« 2 

. 3 

— -  U 

■  i   5 


Variable  2.  Radios  per  1000  Population 
min=5.1j     200      1*00      600 


800 


1000     1233.5=max 


-t-  5 


Variable  3.  Students  in  Higher  Ed.  per  Mil.  Pop. 
min=6.0        8000  16000         21*000 


28l400=max 


Variable  U.     Ethno-Linguistic  Fraction. 
min=0.0  .2  #U 


.6 


.8    0.926=max 
1 


Variable  5.  Press  Freedom  Index 
min=-3.5l        -2 


3.06=max 


2 

—  3 
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TABLE    IV 

RESULTS  OF  HIGHLIGHTING  DEMONSTRATION: 
PROBABLE  INTERACTIONS  AMONG  VARIABLES 
DEDUCED  AT  CLUSTER  LEVEL   k  =  8 


1.  There  appear  to  be  no  significant  pair-wise  correlations 
(either  positive  or  negative)  among "the  five  variables. 

2.  A  high  value  in  Variable  3  tends  to  be  accompanied  by  a 
high  value  in  Variable  5.   But  the  inverse  and  converse 
are  not  true  (i.e.,  a  high  value  in  Variable  5  does  not 
imply  a  high  value  in  Variable  3,  and  a  low  value  in 
Variable  3  does  not  imply  a  low  value  in  Variable  5). 

3.  A  very  high  value  in  Variable  k   tends  to  be  accompanied 
by  a  low  value  in  Variables  1,  2,  and  3. 

4 .  The  combination  of  high  value  in  Variable  1  and  a  low 
value  in  Variable  4  tends  to  be  accompanied  by  a  high 
value  in  Variable  5. 


For  ready  reference,  the  variable  names  are: 

Variable  1.   Concentration  of  Population  in  Cities,  1965 

Variable  2.   Radios  per  1000  Population,  1965 

Variable  3.   Students  in  Higher  Education  (Third  Level) 
per  One  Million  Population,  1965 

Variable  4,   Ethno-Linguistic  Fractionalization 

Variable  5.   Press  Freedom  Index,  1965 
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to  objectively  confirm,  or  perhaps  modify,  the  user's  previous, 
subjective,  classifications. 

2 .   Scenario  for  Demonstration 

People  commonly  think  about  the  countries  of  the  world 
as  members  of  clusters.   They  use  labels  like  "the  Western 
World",  "the  Communist  Bloc",  "the  Have's"  and  "the  Have-not's" 
every  day,  and  they  frequently  hear  more  esoteric  terms  like 
"tri-polar  world",  "five-polar  world"  and  "spheres  of  influence", 
all  of  which  have  classificatory  overtones. 

No  doubt  such  classifications  are  convenient  and  useful; 
but  as  they  exist  now,  many  are  also  subjective  and  confusing. 
When  two  speakers  discuss  the  behavior  of  "the  Communist  Bloc" 
without  first  enumerating  the  members  of  that  bloc,  they  may 
disagree  violently  until  they  discover  that  one  of  them  includes 
Cuba  and  Chile  in  his  definition  but  exludes  Yugoslavia,  while 
the  other  has  done  the  reverse.   If  our  classifications  of 
countries  are  useful  but  subjective,  it  would  seem  desirable 
to  make  them  more  objective. 

Imagine  that  a  political  scientist  in  the  State 
Department  wished  to  inject  some  objectivity  into  the  terms 
"Western  World",  "Communist  Bloc",  "Soviet  Bloc",  etc.   His 
first  step  would  probably  be  to  identify  the  several  theories 
(form  of  government,  internal  economic  system,  external 
political  ties,  external  economic  ties,  etc.)  that  are  commonly 
used  to  define  the  terms  in  question.   Then  he  would  probably 
select  one  of  these  theories  for  quantification  and  search  out 
measurable  factors  (preferably  statistics  that  had  already 
been  measured)  with  which  to  express  it. 
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For  example,  imagine  that  he  selected  the  theory  that 
external  economic  ties  are  the  prime  mover  in  the  concept  of 
bloc  membership.   Then  his  search  for  measurable  factors  would 
certainly  lead  to  statistics  such  as  level  of  foreign  aid 
received  from  every  other  country,  value  of  imports  received 
from  every  other  country,  and  value  of  exports  sent  to  every 
other  country,  each  of  these  statistics  prorated  against  the 
host  country's  GNP  and/or  population. 

The  final  step  for  our  State  Department  researcher 
would  be  to  combine  these  measurable  factors  mathematically 
so  as  to  output  a  bloc  membership  label  for  each  country  of 
the  world;  that  is,  he  would  write  a  "factors-to-bloc  trans- 
formation".  If  he  were  not  acquainted  with  Cluster  Analysis, 
he  might  well  try  to  write  a  single  function  of  the  form 

bloc  membership  =  F(aidus,  aidusSR>  aidTJK*  *'* 

tradeus,  tradeUSSR,  tradeJApAN,  tradeCQM>MKT> , 
GNP,  population,  etc.) 

where  "bloc  membership"  is  a  discrete  variable  which  can  take 
on  three  or  perhaps  five  predetermined  values.   But  such  a 
function  would  probably  be  crippled  by  two  weaknesses:   exces- 
sive complexity  and  theoretical  inadequacy.   The  reader  can 
certainly  visualize  how  complicated  such  a  function  would  have 
to  be  in  order  to  have  broad  applicability.   Moreover,  no 
matter  how  complex  the  function,  it  would  necessarily  ignore 
an  obvious  fact  about  blocs  of  countries:   two  countries  can 
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be  closely  bound  in  a  bloc  not  by  economic  dependency  on  each 
other  but  by  their  simultaneous  economic  dependency  on  an 
intermediate  country. 

Cluster  Analysis  has  far  more  potential  as  a  "factors- 
to-bloc"  transformation.   It  does  not  share  the  dual  weaknesses 
of  the  functional  transformation.   First  of  all,  the  ability 
to  group  two  countries  together  through  an  intermediary  is 
intrinsic  to  every  clustering  algorithm  (so  long  as  the 
Complete  Link-Furthest  Neighbor  sorting  strategy  is  not  used) . 
And  secondly,  Cluster  Analysis  requires  that  the  user  define 
only  a  transformation  from  measurable  factors  to  a  pair-wise 
Dissimilarity  Coefficient  rather  than  a  transformation  from 
measurable  factors  to  bloc  membership.   Surely  the  former 
should  be  less  complex  than  the  latter, 
3.   Choice  of  Data 

To  demonstrate  this  application,  the  author  again 
usurped  the  role  of  State  Department  political  scientist  and 
selected  the  following  statistics  as  measurable  factors  with 
which  to  objectively  classify  countries  into  blocs: 

Variable  1.   Gross  National  Product  per  Capita,  1965 

Variable  2.   Trade  as  percentage  of  Gross  National  Product, 
1965 

Variable  3.   Soviet  Aid  per  Capita,  195^  -  1965 

Variable  k.      U.S.  Economic  Aid  per  Capita,  1958  -  1965 

Values  of  these  variables  for  each  of  85  countries  are  listed 
in  Table  V. 
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Here  again  the  unavailability  of  data  made  the 
demonstration  artificial.   As  was  asserted  in  the  preceding 
section,  the  blocking  of  countries  by  external  economic  ties 
should  depend  primarily  on  pair-wise  data.   But  this  researcher 
did  not  have  access  to  any  standardized,  comprehensive  pair- 
wise  data.   Variables  3  and  4  above  are  pair-wise  but  not 
comprehensive.   Foreign  aid  is  provided  in  substantial  amounts 
by  countries  other  than  the  United  States  and  the  Soviet  Union. 
But  this  researcher  could  not  locate  any  but  the  most  piece- 
meal data  on  other  donors.   Variable  2  above  is  not  pair-wise 
at  all.   Pair-wise  trade  data  is  collected  by  the  Interna- 
tional Monetary  Fund,  and  their  data  is  both  standardized 
and  reasonably  comprehensive.   But  that  data  is  not  made 
available  to  the  public  in  comprehensive  form.    Without 
pair-wise  trade  data  it  is  virtually  impossible  to  construct 
a  logical  theory  for  blocking  countries  by  external  economic 
ties.   Nevertheless,  this  demonstration  was  carried  through 
to  completion  because  its  purpose  is  not  to  deduce  substantive 


After  completion  of  the  research  described  here,  the 
author  did  obtain  access  to  the  IMF  data  and  began  a  Cluster 
Analysis  on  it.   But  the  results  were  not  obtained  in  time 
to  incorporate  them  in  this  thesis.   See  Section  II. B. 7  for 
a  description  of  the  work  in  progress.   The  data  was  obtained 
through  the  Inter-University  Consortium  for  Political  Research, 
on  computer  tape.   The  reason  why  the  data  is  not  generally 
available  was  obvious:   its  sheer  magnitude.   For  purposes  of 
data  collection,  the  IMF  defines  207  countries,  and  207 
countries  taken  two  at  a  time  produce  21,321  trading  combina- 
tions.  The-  complete  data  file  contains  almost  500,000  numbers. 
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results  but  to  demonstrate  a  procedure.   And  that  procedure 
can  still  be  demonstrated  using  the  foreign  aid  data  (Variables 
3  and  *J)  which  is  pair-wise,  although  incomplete. 
k.      Choice  of  Dissimilarity  Coefficient 

This  section  describes  the  process  of  converting  the 
data  in  Table  V  to  a  matrix  of  Dissimilarity  Coefficients. 

When  Cluster  Analysis  was  used  to  highlight  the  inter- 
actions of  variables  (Section  II. A. 4  above),  the  choice  of 
DC  was  motivated  by  a  desire  to  have  all  five  variables 
weighted  equally,  to  prevent  the  user's  preconceptions  from 
affecting  the  results.   Precisely  the  opposite  is  true  here. 
Here  the  author  presumed  that  he  already  knew  how  the 
variables  interact.   He  wanted  to  incorporate  that  knowledge 
into  the  DC.   The  DC  was  constructed  using  the  following 
rationale : 

First  of  all,  it  was  decided  that  the  DC  between  a 
foreign  aid  donor  and  any  other  country  should  be  inversely 
related  to  the  level  of  that  foreign  aid.   Thus,  for  a  first 
cut,  the  formulas 


DC(US,i)  ~   ■  7  r-p-  and   DC(SOV,i) 


1  +  usaid.  , 

i  1  +  sovaid. 

were  considered.   Next  it  was  observed  that  27  of  the  85 
countries  received  foreign  aid  from  both  the  United  States 
and  the  Soviet  Union.   To  incorporate  relative  dependency 
into  the  formulas,  it  was  decided  to  insert  a  ratio  of  aid 
levels.   Hence  the  following  formulas  were  considered. 


34 


1  +  sovaid.  1  +  usaid. 

DC(US,i)-v  =■     and    DC(SOV,i)~ i- 

1  +  usaid.  1  +  sovaid. 

i  i 

These  formulas  seemed  reasonable  except  that  the  same  level 
of  foreign  aid  has  smaller  impact  on  a  rich  country  than  it 
does  on  a  poor  country.   So  it  was  decided  to  insert  GNP.  as 
a  scaling  factor  wherever  an  aid  term  appeared  in  either 
formula.   But  this  insertion  tended  to  greatly  reduce  the 
size  of  the  aid  terms  with  respect  to  the  "1"  terms.   There- 
fore the  "1"  terms  were  arbitrarily  reduced  to  "0.1", 
producing 


sovaid.  usaid. 

°']-  +  "gnpT^  °'1  +  -gnpT 

DC(US'i}  =  u^T  and    DCCSOV.i)  ■  iHTaTdT 

0.1  +  i-  0.1  +   pMp   X 

GNP,  GNPi 


i 


The  fact  that  DC(US,i)  and  DC(S0V,i)  are  reciprocals  and  the 
fact  that  they  are  dimensionless  had  intuitive  appeal.   The 
only  apparent  shortcomings  were  the  two  imposed  by  unavaila- 
bility of  data:   aid  from  other  countries  is  ignored,  and 
pair-wise  trade  is  ignored.   Although  a  total  trade  figure 
was  available,  there  seemed  to  be  no  logical  way  to  substitute 
it  for  the  missing  pair-wise  figure. 

At  this  point  it  was  verified  that  the  DC(US,i) 
formula  would  apply  to  every  country  dyad  in  which  the  United 
States  is  a  member,  except  for  the  United  States  -  Soviet  Union 
dyad.   Similarly,  the  DC(S0V,i)  formula  applies  to  every 


35 


country  dyad  in  which  the  Soviet  Union  is  a  member,  except 
for  the  United  States  -  Soviet  Union  dyad.   Thus  it  remained 
to  construct  formulas  for  the  United  States  -  Soviet  Union 
dyad  and  for  all  dyads  in  which  neither  the  United  States  nor 
the  Soviet  Union  is  a  member.   Hopefully  the  same  formula 
would  apply  to  both.  •  . 

But  here  the  lack  of  pair-wise  trade  data  was  really 
crippling.   The  only  pair-wise  economic  ties  of  any  signi- 
ficance involved  pair-wise  trade.   The  only  logical  formula 
necessarily  involved  the  inverse  of  pair-wise  trade.   There 
seemed  no  natural  way  to  use  the  available  data  on  total 
trade.   Finally,  in  desperation  it  was  rationalized  that  a 
country  whose  foreign  trade  is  large  with  respect  to  its  GNP 
tends  to  have  closer  ties  with  another  country  in  the  same 
situation.   It  was  decided  that  the  nucleus  of  the  formula 
should  be 


DC(i,j)-^'  |trade.  -  trade. 


But  because  of  the  weak  theory  here  as  compared  to  the 
rigorous  formulas  for  DC(US,i)  and  DC(SOV,i)  it  was  decided 
to  diminish  the  effect  of  trade  difference  when  foreign  aid 
recipients  are  involved.   Hence  it  was  decided  to  expand  the 
formula  to 


sovaid.  +usaid.  +sovaid.  +usaid. 
DC(i,j)  =  J  |tradei  -  trade  .  | ^— ^ « iL 
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The  dearth  of  theoretical  foundation  here  is  admitted.   It 
casts  suspicion  on  the  values  in  the  DC  matrix  and  on  the 
dendrogram  finally  obtained.  tBut  the  reader  is  again  reminded 
that  the  purpose  here  is  to  demonstrate  the  procedure,  not  to 
deduce  substantive  results. 

Turning  from  the  substance  to  the  procedure,  there  is 
a  significant  departure  from  normal  Cluster  Analysis  procedure, 
taken  above,  that  warrants  explanation.   Two  completely  differ- 
ent DC  formulas  have  been  developed,  one  to  be  used  when  the 
United  States  or  Soviet  Union  is  a  dyad  member  and  another 
to  be  used  the  rest  of  the  time.   The  mathematical  significance 
of  this  duality  is  that  the  DC  formulas,  taken  collectively, 
produce  gross  violations  of  the  metric  inequality,  which  is 

DC(a,b)  +  DC(b,c)  >  DC(a,c)     for  all  a,b,c 

Generally,  it  is  desirable  although  not  essential  that  a  matrix 
of  DCs  satisfy  the  metric  inequality.   When  they  do  not,  the 
clustering  algorithm  can  be  expected  to  produce  high 
"distortion"  between  the  matrix  of  DCs  and  the  dendrogram. 
(Loosely  defined,  "distortion"  is  the  difference  between 
DC(i,j)  and  the  level  at  which  country  i  and  country  j  cluster 
together  in  an  agglomerative  algorithm. )   But  of  what  signi- 
ficance is  high  distortion?   The  word  carries  derogatory 
connotations,  but  is  distortion  really  undesirable  in  Cluster 
Analysis?  This  author  maintains  that  it  depends  on  the  purpose 
of  the  clustering.   In  the  "highlighting  of  variables"  appli- 
cation, distortion  was  not  desirable:   figuratively  speaking, 
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each  country  had  been  plotted  in  five-dimensional  space  and 
the  clustering  algorithm  was  searching  for  natural  clusters, 
as  plotted.   But  in  this  "objective  classifying  of  countries" 
application,  distortion  is  natural:   the  original  pair-wise 
similarities  specified  in  the  DC  matrix  cannot  be  expected 
to  be  representable  in  Euclidean  space,  and  during  the 
clustering  it  is  desired  that  these  original  similarities  be 
affected  by  intermediate  countries.   With  this  reasoning,  it 
is  asserted  that  violation  of  the  metric  inequality  is  neces- 
sary and  that  the  use  of  two  or  more  DC  formulas  is  acceptable. 

Using  the  formulas  developed  above,  the  3^0  data  points 
in  Table  V  were  transformed  to  a  matrix  of  Dissimilarity 
Coefficients.  ^ 

5.   Choice  of  Algorithm 

Here,  as  in  the  highlighting  demonstration,  the  first 
decision  point  was  to  choose  among  the  agglomerative ,  divisive 
and  reallocative  algorithms.   Again  the  divisive  algorithms 
were  discarded  because  they  are  not  as  well  documented  as 
their  agglomerative  counterparts.   The  reallocative  algorithms 
did  not  apply  because  they  require  that  the  DC  be  a  metric. 
Hence  the  agglomerative  algorithm  was  selected. 

The  final  decision  was  to  select  a  sorting  strategy. 
The  Complete  Linkage-Furthest  Neighbor  strategy  was  eliminated 
from  consideration  here;  it  does  not  permit  any  chaining, 
which  is  desirable  in  this  application.   On  the  other  end  of 
the  spectrum,  the  Single  Linkage-Nearest  Neighbor  sorting 
strategy  maximizes  chaining,  often  to  the  extent  that  natural 
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clusters  are  obscured.   For  this  application  it  was  deter- 
mined to  use  one  of  the  compromise  sorting  strategies.   Among 
these,  Group  Average  sorting  seemed  to  correspond  with  the 
concepts  of  bloc  membership,  ratios  of  foreign  aid,  etc.   It 
is  expressed  mathematically  as 


n.  n. 

DC(k,ij)  =     X    DC(k,i)  +     J    DC(k,j) 
ni    nj  ni     J 


The  dendrogram  in  Drawing  2  was  obtained  using  the 
Group  Average  sorting  strategy  in  an  agglomerative  algorithm. 
But  once  again  the  reader  is  cautioned  that  the  results  are 
suspect . 

6 .   Explanation  of  Dendrogram 

Despite  the  admitted  artificiality  of  the  results 
obtained  here,  the  layman  might  appreciate  an  explanation  of 
the  information  available  in  any  dendrogram  produced  by 
Cluster  Analysis. 

The  key  to  reading  a  dendrogram  is  the  concept  of 
"cluster  level."   By  merely  specifying  a  cluster  level,  the 
following  information  can  be  read  from  the  dendrogram:   the 
number  of  clusters  and  the  countries  contained  in  each  cluster 
That  is,  there  is  a  correspondence  from  cluster  level  to  a 
partition  of  the  countries. 

The  scale  at  the  bottom  of  Drawing  2  is  a  cluster 
level  scale.   Note  that  the  minimum  value  of  cluster  level  is 
0.0  at  the  far  left  and  the  maximum  value  is  20.0  at  the  far 
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right.   A  low  cluster  level  specifies  a  partition  having  many 
small  clusters,  while  a  high  cluster  level  specifies  a  parti- 
tion having  a  few  large  clusters.   Thus  cluster  level  can  be 
thought  of  as  a  measure  of  the  largest  dissimilarity  (or, 
equivalently,  the  weakest  bond)  present  within  any  cluster  in 
the  partition. 

For  example,  consider  cluster  level  0.0,  the  minimum 
observed  cluster  level  in  Drawing  2.   At  cluster  level  0.0, 
the  85  countries  are  partitioned  into  77  clusters.   Seventy- 
two  of  these  77  contain  only  a  single  country.   Four  of  the  77 
contain  exactly  two  countries.   And  one  cluster  contains  5 
countries:   Canada,  Ireland,  Switzerland,  Sweden  and  Denmark. 
■'Since  0.0  is  the  minimum  observed  cluster  level,  we  may  conclude 
that  the  strongest  possible  bonds  exist  within  every  cluster. 
Specifically,  we  may  conclude  that  Canada,  Ireland,  Switzerland, 
Sweden  and  Denmark  are  bound  together  by  the  tightest  possible 
economic  ties.   Our  mathematical  model  will  not  separate  them 
even  at  the  lowest  cluster  level. 

Consider  next  a  slightly  higher  cluster  level,  say  1.3. 
Here  we  are  permitting  slightly  weaker  bonds  to  be  present 
within  clusters.   We  find  that  the  85  countries  are  here 
partitioned  into  55  clusters.   Thirty-three  of  those  clusters 
contain  a  single  country,  twelve  contain  exactly  two  countries, 
five  contain  exactly  three  countries,  one  contains  four 
countries,  and  one  contains  nine  countries.   In  the  nine-country 
cluster,  Canada,  Ireland,  Switzerland,  Sweden  and  Denmark  have 
been  joined  by  New  Zealand,  South  Africa,  France  and  Australia. 
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We  may  conclude  that  slightly  weaker  economic  ties  bind  the 
four  new  countries  to  the  original  five. 

Similar  inferences  can  be  drawn  from  any  dendrogram 
produced  by  Cluster  Analysis. 

7 .   Work  in  Progress  >v 

Throughout  this  second  demonstration  of  Cluster 
Analysis  it  has  been  emphasized  that  the  unavailability  of 
pair-wise  trade  data  made  the  demonstration  artificial.   But 
this  artificiality  can  soon  be  removed.   Pair-wise  trade  data 
was  recently  provided  to  this  author  through  the  Inter- 
University  Consortium  for  Political  Research  [Ref.  S~\.      Time 
will  not  permit  this  author  to  complete  a  Cluster  Analysis 
on  the  data,  but  if  another  researcher  chooses  to  undertake 
it,  the  following  plan  of  attack  is  suggested. 

a.   Step  1  -  Reduce  data  file  to  manageable  size 

The  ICPR  data  file  contains  approximately  333,720 
pairwise  trade  data:   annual  trade  values,  in  millions  of 
U.S.  dollars,  for  the  years  1958  through  1968,  among  207  differ- 
ent "countries."   Many  of  these  "countries"  are  actually 
colonies  and  many  others  have  negligible  foreign  trade  except 
with  a  single  "sponsor  country."   The  logical  first  step  is 
to  selectively  reduce  the  size  of  the  data  file  by  eliminating 
the  insignificant  "countries",  and  by  selecting  a  single  year 
and  eliminating  the  other  nine.   It  is  recommended  that  all 
"countries"  be  eliminated  except  the  136  nations  having  a 
population  of  one  million  or  more  and  those  smaller  nations 
having  membership  in  the  United  Nations  as  of  1968.   These 
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136  nations  are  listed  on  pages  1  through  H    of  Ref.  8.   It  is 
further  recommended  that  the  year  1967  be  used  and  the  rest 
be  eliminated  temporarily.   The  author  has  determined  that, 
through  the  first  one-sixth  of  the  file,  1967  has  fewer  zero 
entries  than  any  other  year  (a  zero  entry  signifies  either 
trade  less  than  100,000  dollars  or  missing  data).   This 
selective  reduction  of  the  data  should  reduce  the  file  length 
to  about  one  eighth  its  original  length. 

b.   Step  2  -  Sort  and  combine  data 

Preparatory  to  sorting  the  data,  the  reduced  data 
file  should  be  stored  on  either  a  disk  or  a  data  cell  rather 
than  magnetic  tape.   The  ICPR  normally  provides  the  data  on 
tape,  and  tape  is  a  satisfactory  input  to  the  data  reduction 
process  in  step  1  because  that  process  can  be  sequential, 
reading  the  file  once  from  beginning  to  end.   However,  the 
sorting  process  about  to  be  described  cannot  read  the  file 
sequentially,  and  magnetic  tape  is  a  very  inefficient  input 
to  processes  that  must  search  the  data. 

The  ICPR  data  file  does  not  list  one  trade  figure 
per  country  dyad  per  year.   It  lists  up  to  four  figures, 
namely, 

1.  Value  of  exports  from  i  to  j,  as  reported  by  i 

2.  Value  of  exports  from  i  to  j ,  as  reported  by  j 

3.  Value  of  exports  from  j  to  i,  as  reported  by  j 

4.  Value  of  exports  from  j  to  i,  as  reported  by  i 

Hopefully  numbers  1  and  2  are  approximately  equal  and  numbers 
3  and  k   are  approximately  equal.   If  so,  then  total  trade 
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between  i  and  j  is  the  sum  of  1  and  3.   It  is  recommended 
that  this  approximate  equality  be  assumed  for  the  initial  run 
of  this  "sort  and  combine"  process.   Then  the  process  is 
simple:   search  the  file  for  the  first  record  involving  the 
i-j  dyad;  identify  it  with  respect  to  direction  of  trade, 
regardless  of  reporting  country;  continue  searching  for  the 
second  record  involving  the  i-j  dyad;  identify  it  with  respect 
to  direction;  if  the  directions  are  opposite  then  sum  the  two 
values  and  store  them;  if  the  directions  are  the  same  then 
ignore  the  second  value  and  continue  searching  for  the  third 
record;  and  so  on.   The  reason  why  shortcuts  are  in  order  for 
the  initial  run  is  that  this  "sort  and  combine"  process  will 
have  to  be  performed  9180  times  (136  countries,  taken  two  at 
a  time,  yields  9180  different  combinations) . 

c.   Step  3  -  Choose  a  Dissimilarity  Coefficient 

The  following  formula  is  recommended  as  a  DC,  at 
least  initially: 

DC(i,j) 


1  +  Trade. . 


More  elaborate  formulas  can  be  developed  later  by  incorporating 
the  rationale  in  Section  II. B. 4  of  this  thesis. 

d.   Step  4  -  Choose  a  Clustering  Algorithm 

It  is  recommended  that  an  agglomerative  algorithm 
with  Group  Average  sorting  strategy  be  used,  for  the  same 
reasons  that  it  was  selected  in  Section  II. B. 5  above.   This 
involves  making  the  following  additions  and  substitutions  in 
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the  computer  program  listed  at  the  end  of  this  thesis: 

Immediately  before   DO  73  E=1,N   insert  the  following  two 
statements : 

RATA  =  S(A,A)/(S(A,A)+S(B,B)) 
RATB  =  S(B,B)/(S(A,A)+S(B,B)) 

In  place  of 

DS(E)  =  AMAX1(S(E,A),S(E,B)) 
substitute 

DS(E)  =  RATA*S(E,A)  +  RATB*S(E,B) 
Similarly,  in  place  of 

70  DS(E)  =  AMAX1(S(A,E),S(B,E)) 
substitute 

70  DS(E)  =  RATA*S(A,E)  +  RATB*S(B,E) 
And  finally,  in  place  of 

71  DS(E)  =  AMAX1(S(E,A),S(B,E)) 
substitute 

71  DS(E)  =  RATA*S(E,A)  +  RATB*S(B,E) 
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III.   CONCLUSION 

This  thesis  has  demonstrated  two  potential  uses  of  Cluster 
Analysis  in  which  the  nations  of  the  world  are  treated  as 
measurable  objects.   The  substantive  results  obtained  in  each 
demonstration  are  not  presented  as  conclusions;  they  were 
derived  incidentally  while  demonstrating  methods.   It  is 
asserted  that  the  two  uses  illustrated  here,  markedly  differ- 
ent in  several  respects,  are  representative  of  a  wide  range  of 
applications  for  Cluster  Analysis  in  the  fields  of  political 
science  and  international  relations.   Although  Cluster  Analysis 
was  developed  for  the  physical  sciences  and  has  so  far  received 
scant  attention  outside  that  context,  it  is  readily  adaptable 
to  the  social  sciences.   In  particular,  it  is  extremely  well 
suited  to  model  building  and  statistical  analysis  involving 
the  nations  of  the  world.   As  such,  it  warrants  the  attention 
of  the  U.S.  State  Department. 
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APPENDIX  A 


DATA 


Except  for  three  data  points,  all  data  used  in  this  thesis 
were  made  available  by  the  Inter-University  Consortium  for 
Political  Research.   The  data  were  originally  collected  by 
Charles  Lewis  Taylor  and  Michael  C.  Hudson.   Neither  the 
original  collectors  of  the  data  nor  the  consortium  bear  any 
responsibility  for  the  analysis  or  interpretations  presented 
here . 

Following  are  the  precise  definitions  of  the  nine  vari- 
ables used  in  this  thesis.   All  definitions  are  extracted 
verbatim  from  Ref.  8. 

Variable  name:   Concentration  of  Population  in  Cities,  19&5 
Definition:   Concentration  is  defined  as:   the  sum  over  all 
cities  of  the  squares  of  the  proportion  of  the  total  popula- 
tion residing  in  each  city.   Concentration  is  higher  the  fewer 
cities  and  the  greater  the  size  of  the  largest  city  relative 
to  the  total  population.  [Ref.  8,  p.  16] 

Variable  name:   Radios  per  1000  Population,  1965 
Definition:   Figures  relate  to  all  types  of  receivers  including 
those  connected  to  a  re-distribution  system.   They  relate 
either  to  the  number  of  licenses  issued  or  sets  declared  or 
to  the  estimated  number  of  receivers  in  use.   In  many  countries 
a  license  may  cover  more  than  one  receiver  in  the  same  house- 
hold.  Data  exclude  television  sets.  [Ref.  8,  p.  32] 
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Variable  name:   Students  in  Higher  Education  (Third  Level) 
per  One  Million  Population,  1965. 

Definition:   Data  refer  to  the  enrollment  in  all  institutions 
of  education  at  the  third  level,  i.e.,  degree  granting  and 
non-degree  granting  institutions  of  both  private  and  public 
higher  education  of  all  types.   These  include  universities, 
higher  technical  schools,  teacher  training  schools,  theological 
schools,  etc.   As  far  as  possible  part  time  students  are 
included  in  the  figures  but  correspondence  courses  and  auditors 
are  generally  excluded.  [Ref.  8,  p.  4l] 

Variable  name:   Ethno-Linguistic  Fractionalization 
Definition:   The  main  source  for  this  variable  (Atlas  Narodov 
Mira)  makes  little  distinction  between  ethnic  and  linguistic 
differences  in  its  definition  and  collection  of  data.   Groups 
are  determined  not  by  their  physical  characteristics  but  by 
their  roles,  their  descents  and  their  relationships  to  others. 
An  index  of  fractionalization  calculated  upon  data  from  Atlas 
does  correlate  highly  with  a  similar  index  calculated  upon 
linguistic  data  from  other  sources,  but  not  quite  highly 
enough  to  be  considered  the  same  indicator.   Other  sources 
used  here  report  only  linguistic  data.   Index  of  fractionaliza- 
tion was  calculated  by  the  following  formula: 

F  =  1  (N  subi  /  N)  (N  subi  -  1/N-l) 

where  N  subi  -   number  of  people  in  the  ith  group 
and   N  =  total  population   [Ref.  8,  p.  46-] 
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Variable  name:   Press  Freedom  Index,  1965 

Definition:   This  index,  created  by  the  School  of  Journalism, 
University  of  Missouri,  is  "designed  to  measure  the  indepen- 
dence of  a  nation's  broadcasting  and  press  system  and  its 
ability  to  criticize  its  own  local  and  national  governments." 
The  index  is  comprised  of  the  judgements  of  panels  of  native 
and  foreign  newsmen  on  23  aspects  of  the  press  (e.g.,  extent 
of  legal  controls,  licensing,  government  ownership,  criticism 
and  censorship).   For  a  fuller  description,  see  Ralph  L. 
Lowenstein,  "PICA  (Press  Independence  and  Critical  Ability) 
Index:   Measuring  V/orld  Press  Freedom,"  University  of  Missouri, 
School  of  Journalism  Freedom  of  Information  Center  Publication 
#166  (August,  1966).   The  index,  which  consists  of  averages 
of  the  judges'  scores,  has  a  range  from  -lJ.00  for  less  freedom 
to  +4.00  for  more.  [Ref.  8,  p.  116] 

Variable  name:   Gross  National  Product  per  Capita,  1965 
Definition:   This  variable  was  derived  by  dividing  Gross 
National  Product  in  millions  of  U.S.  dollars  by  total  popula- 
tion in  thousands.   Gross  National  Product  is  reported  in 
constant  U.S.  dollars  and  refers  to  gross  national  product 
even  for  countries  which  normally  report  their  national 
accounts  in  terms  of  net  material  product  or  other  concepts. 
[Ref.  8,  p.  65] 

Variable  name:   Trade  as  percentage  of  Gross  National  Product, 

1965. 

Definition:   This  variable  was  derived  by  dividing  total  trade 
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(imports  plus  exports,  merchandise  only)  by  Gross  National 
Product.  [Ref.  8,  p.  69] 

Variable  name:   Soviet  Aid  per  Capita,  195^  -  1965 
Definition:   This  variable  was  derived  by  dividing  total 
Soviet  aid  by  total  population.   Total  Soviet  aid  data  refer 
to  Soviet  economic  credits  and  grants  to  countries  in  terms 
of  thousand  U.S.  dollars  for  the  period  195V5  -  1965 . 
[Ref.  8,  p.  107] 

Variable  name:   U.S.  Economic  Aid  per  Capita,  1958  -  1965 
Definition:   This  variable  was  derived  by  dividing  total 
U.S.  economic  aid  by  total  population.   Total  U.S.  economic 
aid  data  refer  to  grants  and  loans  and  are  given  in  millions 
of  U.S.  dollars  for  the  period  July  1,  1958  through  June  30, 
1965.  [Ref.  8,  p.  107] 

The  three  data  points  not  provided  by  the  ICPR  are  listed 
below.   The  ICPR  data  file  listed  all  three  as  missing  data. 
But  in  each  case  this  author  preferred  to  introduce  an  approxi- 
mate (or  even  erroneous)  value  rather  than  eliminate  the 
particular  country  from  the  Cluster  Analysis.   Hence  the  three 
values  were  estimated  in  the  manner  specified.   Note  that  no 
two  estimations  involved  the  same  country.   All  countries 
missing  two  or  more  data  (among  the  nine  variables  used)  in 
the  ICPR  data  file  were  omitted  from  the  Cluster  Analysis  at 
the  outset. 
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Country:   Chile 

Variable  name:   Radios  per  1000  Population,  1965 

Estimated  value:   2^0.0 

Method  of  estimation:   Average  of  values  for  Peru  and 

Argentina. 

Country:   Chad 

Variable:   Students  in  Higher  Education  (Third  Level)  per 

One  Million  Population,  1965 

Estimated  value:   230.0 

Method  of  estimation:   Average  of  values  for  Mali,  Upper  Volta, 

Sudan  and  Cameroon. 

Country:   Zambia 

Variable:   Students  in  Higher  Education  (Third  Level)  per 

One  Million  Population,  1965 

Estimated  value:   170.0 

Method  of  estimation:   Average  of  seventeen  neighboring 

countries . 
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